Skip to content

feat: Replace mutable buffers with immutable Arrow vectors in NativeBatchReader#3382

Closed
andygrove wants to merge 11 commits into
apache:mainfrom
andygrove:immutable-constant-column-reader
Closed

feat: Replace mutable buffers with immutable Arrow vectors in NativeBatchReader#3382
andygrove wants to merge 11 commits into
apache:mainfrom
andygrove:immutable-constant-column-reader

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Feb 3, 2026

Rationale

I would like to remove the remaining mutable buffer use from Comet so that we can use Arrow FFI best practices.

Summary

  • Add ImmutableConstantColumnReader that creates Arrow vectors directly in Java without using native Rust mutable buffers, used for partition columns and missing columns in NativeBatchReader
  • Support primitive types: Boolean, Byte, Short, Integer, Long, Float, Double, String, Binary, Date, Timestamp, Decimal, Null
  • CometScanRule checks partition column types at planning time and falls back to Spark if complex types (StructType, ArrayType, MapType) are used

Test plan

  • Existing tests pass
  • Manual testing with partition columns of various primitive types

🤖 Generated with Claude Code

andygrove and others added 5 commits February 3, 2026 11:22
…mentation

The documentation incorrectly claimed that native_iceberg_compat "removes the
use of reusable mutable-buffers". In reality, both native_comet and
native_iceberg_compat use reusable mutable buffers when transferring data
via Arrow FFI.

This commit:
- Removes the inaccurate claim
- Replaces it with accurate description of Parquet decoding delegation
- Adds a note explaining the actual mutable buffer behavior
- Links to the FFI documentation for details on arrow_ffi_safe flag

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Clarified note on mutable buffers and updated details on `native_iceberg_compat` implementation.
…ader

Add ImmutableConstantColumnReader that creates Arrow vectors directly
in Java without using native Rust mutable buffers. This is used for
partition columns and missing columns in NativeBatchReader.

Key changes:
- New ImmutableConstantColumnReader creates Arrow vectors using Arrow
  Java APIs, supporting primitive types (Boolean, Byte, Short, Integer,
  Long, Float, Double, String, Binary, Date, Timestamp, Decimal, Null)
- NativeBatchReader now uses ImmutableConstantColumnReader instead of
  ConstantColumnReader for partition and missing columns
- CometScanRule checks partition column types at planning time and
  falls back to Spark if complex types (StructType, ArrayType, MapType)
  are used, since ImmutableConstantColumnReader only supports primitives

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove changed the title Replace mutable buffers with immutable Arrow vectors in NativeBatchReader feat: Replace mutable buffers with immutable Arrow vectors in NativeBatchReader Feb 3, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 2.18579% with 179 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.38%. Comparing base (f09f8af) to head (6bdfc9e).
⚠️ Report is 922 commits behind head on main.

Files with missing lines Patch % Lines
...e/comet/parquet/ImmutableConstantColumnReader.java 0.00% 171 Missing ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala 40.00% 4 Missing and 2 partials ⚠️
...va/org/apache/comet/parquet/NativeBatchReader.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3382      +/-   ##
============================================
+ Coverage     56.12%   59.38%   +3.25%     
- Complexity      976     1463     +487     
============================================
  Files           119      176      +57     
  Lines         11743    16358    +4615     
  Branches       2251     2728     +477     
============================================
+ Hits           6591     9714    +3123     
- Misses         4012     5288    +1276     
- Partials       1140     1356     +216     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove marked this pull request as ready for review February 4, 2026 00:57
@andygrove
Copy link
Copy Markdown
Member Author

Moving this to draft until we have benchmarks to ensure there is no regression with this change

@andygrove andygrove marked this pull request as draft February 4, 2026 19:05
andygrove and others added 6 commits February 5, 2026 07:55
Instead of materializing full N-element Arrow arrays for partition and
missing columns, export 1-element arrays from JVM and expand them on
the native side using ScalarValue. This avoids O(N) memory allocation
and copying for constant columns.

- Add CometConstantVector: stores a single value, lazily creates a
  1-element Arrow vector for FFI export, returns constant for all rowIds
- Modify ImmutableConstantColumnReader to produce CometConstantVector
- Add CometConstantVector case in NativeUtil.exportBatch() to skip
  row count validation for 1-element vectors
- In scan.rs, detect 1-element arrays and expand via ScalarValue when
  actual_num_rows > 1; skip take() for scalar columns with selection
  vectors since constants are unaffected by row deletion

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds a benchmark that writes a partitioned parquet table and measures
scan performance with 1 and 5 partition columns. Tests both reading
data columns alongside partitions and reading partition columns
themselves. This exercises the CometConstantVector → native scalar
expansion path.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds a standalone benchmark that writes partitioned parquet tables and
measures scan performance with 1 and 5 partition columns. Tests both
reading data columns alongside partitions and reading partition columns
themselves. This exercises the CometConstantVector path where constant
columns are exported as 1-element Arrow arrays and expanded on the
native side.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove
Copy link
Copy Markdown
Member Author

This approach is too slow. I am going to try some other approaches.

@andygrove andygrove closed this Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants